<HTML>
<!--This file created 26.03.2005 18:46 Uhr by Claris Home Page version 3.0-->
<HEAD>
   <TITLE>Tight PVM Integration in Grid Engine</TITLE>
   <META NAME=GENERATOR CONTENT="Claris Home Page 3.0">
</HEAD>
<BODY BGCOLOR="#FFFFFF">
<P><FONT SIZE="+1"><B>Topic:</B></FONT></P>

<P>Loose and tight integration of the PVM library into SGE.</P>

<P><FONT SIZE="+1"><B>Author:</B></FONT></P>

<P>Reuti, <A HREF="mailto:reuti__at__staff.uni-marburg.de">reuti__at__staff.uni-marburg.de;</A>
Philipps-University of Marburg, Germany</P>

<P><FONT SIZE="+1"><B>Version:</B></FONT></P>

<P>1.0 -- 2005-03-26 Initial release, comments and corrections are
welcome</P>

<P><FONT SIZE="+1"><B>Contents:</B></FONT></P>

<UL>
   <LI>Advantages of a Tight Integration</LI>
   
   <LI>Prerequisites</LI>
   
   <LI>Loose Integration</LI>
   
   <LI>Additions for Tight Integration</LI>
   
   <LI>Tight Integration</LI>
   
   <LI>Nodes with more than one network interface</LI>
   
   <LI>Restrictions and Future Work</LI>
   
   <LI>References and Documents</LI>
</UL>

<P><FONT SIZE="+1"><B>Note:</B></FONT></P>

<P>This HOWTO complements the information contained in the
$SGE_ROOT/pvm directory of the Grid Engine distribution.</P>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>Advantages of a Tight Integration</B></FONT></P>

<BLOCKQUOTE>The original included files for the PVM integration offer
   only a loose integration of PVM into SGE. This means, that the
   necessary daemons (which build up the PVM), will neither be under
   control of SGE, nor that all files on the slave nodes (created by
   PVM for its internal management) will be correctly deleted in case
   of a job abort. Also the accounting will not be correct, as the
   via rsh started PVM daemons on the slave nodes are not related in
   any way to SGE.
   
   <P>With Tight Integration on the other hand, you will have all
   this, plus that there is also no need to have rsh between the
   nodes enabled, as all startups of the daemons on the slave nodes
   are handled by the built-in qrsh of SGE.</P></BLOCKQUOTE>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>Prerequisites</B></FONT></P>

<BLOCKQUOTE><B>Configuration of SGE with qconf or the GUI</B>
   
   <P>You should already know how to change settings in SGE, like to
   setup and change a queue definition or the entries in the PE
   configuration. Additional information about queues and parallel
   interfaces you can get from the man pages "queue_conf" and
   "sge_pe" of SGE (make sure the SGE man pages are defined in your
   $MANPATH).</P>
   
   <P><B>Target platform</B></P>
   
   <P>This Howto targets the PVM version 3.4.5 on SGE 6.0. Some new
   platforms were added in the necessary build step while creating
   the supporting <I>start_pvm</I> and <I>stop_pvm</I> programs. The
   other three included programs are only examples, and not needed
   for the integration. For users of SGE 5.3, a backport is also
   available.</P>
   
   <P><B>PVM</B></P>
   
   <P>The PVM (Parallel Virtual Machine) is a framework from the Oak
   Ridge National Laboratory (<A HREF="http://www.csm.ornl.gov/pvm/pvm_home.html">http://www.csm.ornl.gov/pvm/pvm_home.html</A>)
   and provides an interface for parallel programs, which allows the
   MPMD (multiple program multiple data) paradigm. Before you start
   with the integration of PVM into SGE, you should already be
   familiar with the operation of PVM outside SGE, like starting the
   daemons and requesting information about the started virtual
   machine from the pvm-console. Although this is not directly
   necessary for the integration into SGE, it will ease the
   understanding of the applied configurations (and detection of
   failures in operation in case that something went wrong). There
   isn't any patch or modification necessary to the original
   distribution of PVM.</P>
   
   <P><B>Included setups and scripts</B></P>
   
   <P>The supplied archive in &#91;<A HREF="#[1] Archive sge-pvm-integration">1</A>&#93;
   will supersede the provided $SGE_ROOT/pvm. It contains modified
   scripts and programs of the original distribution of the PVM
   integration package in SGE, which enables now also the Tight
   Integration. So you may first save the original $SGE_ROOT/pvm,
   before you untar this package in the same location. In case that
   you are using (and will continue to use) the Loose Integration,
   you can still live with this new package, as by default it mimics
   the original behavior and your already existing configurations
   should continue to work without any change.</P>
   
   <P>Another short running program is provided in &#91;<A HREF="#[2] Archive pvm_hello">2</A>&#93;,
   which will allow you to observe the correct distribution of the
   spawned tasks and removal of files in /tmp.</P></BLOCKQUOTE>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>Loose Integration</B></FONT></P>

<BLOCKQUOTE>As there is only one distribution of $SGE_ROOT/pvm now,
   the necessary steps to install this new package will be the same
   as in the original integration package. After you set your working
   directory to be your $SGE_ROOT and untared the supplied archive in
   this location as your admin user for SGE, you will have to compile
   some helping programs. To do this, just change to
   $SGE_ROOT/pvm/src and set some necessary environment variables
   like:
   
   <BLOCKQUOTE><PRE>$ export PVM_ROOT=/home/reuti/pvm3
$ export PVM_ARCH=LINUX
$ export SGE_ROOT=/usr/sge</PRE></BLOCKQUOTE>
   
   <P>The correct locations you have of course to adjust to your
   custom installation. Be aware, that the PVM_ARCH is different by
   naming convention from the architecture SGE is detecting. On a
   Macintosh this leads e.g. to the situation, that SGE is referring
   it as "darwin", and PVM as "DARWIN" (for Linux it may be
   "lx24-x86" for SGE and "LINUX" for PVM). With these settings you
   can compile the programs by just issuing ./aimk in
   $SGE_ROOT/pvm/src. You should now got a directory (e.g. lx24-x86)
   and inside the compiled programs. The script ./install.sh executed
   in the same location will copy these files to the correct target
   in $SGE_ROOT/pvm/bin/&lt;arch&gt;, where startpvm.sh and
   stoppvm.sh is expecting them. These two programs <I>start_pvm</I>
   and <I>stop_pvm</I> (besides the three examples) will help the
   startpvm.sh and stoppvm.sh scripts to set up the PVM in a proper
   way. The <I>start_pvm</I> program will fork a child process, which
   will start the PVM, while the parent will watchout for the correct
   startup, and only be successful if the requested PVM could be
   established. The <I>stop_pvm</I> sends just a halt command to the
   PVM.</P>
   
   <P>Using the command line qconf or the qmon GUI, a sample setting
   for a PVM PE may look like:</P>
   
   <BLOCKQUOTE><PRE>$ qconf -sp pvm
pe_name           pvm
slots             46
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/sge/pvm/startpvm.sh $pe_hostfile $host /home/reuti/pvm3
stop_proc_args    /usr/sge/pvm/stoppvm.sh $pe_hostfile $host
allocation_rule   1
control_slaves    FALSE
job_is_first_task TRUE
urgency_slots     min</PRE></BLOCKQUOTE>
   
   <P>This setup is for SGE 6.x, where the PE must be specified in
   the cluster queue, in which it should be used in the entry
   "pe_list". For SGE 5.3 the queues to be used are defined in the PE
   already. To have a working example which will run for some
   seconds, we can compile the hello.c and hello_other.c in your home
   directoty (taken from the PVM User's Guide, you can find it in the
   supplied pvm_hello.tgz archive) with:</P>
   
   <BLOCKQUOTE><PRE>$ gcc -o hello hello.c -I/home/reuti/pvm3/include -L/home/reuti/pvm3/lib/LINUX -lpvm3
$ gcc -o hello_other hello_other.c -I/home/reuti/pvm3/include -L/home/reuti/pvm3/lib/LINUX -lpvm3</PRE></BLOCKQUOTE>
   
   <P>Prepared with this, we can submit a simple PVM job and observe
   the bahvior on the nodes:</P>
   
   <BLOCKQUOTE><PRE>$ qsub -pe pvm 2 tester_loose.sh</PRE></BLOCKQUOTE>
   
   <P>The start script of the PE will start the daemons on the (in
   this case two) nodes, and the started program will spawn two
   additional processes on each <I>pvmd</I> (besides the main
   routine). You may implement such a logic in case that the main
   process <I>hello</I> will only collect the work of the
   <I>hello_other</I> tasks and not doing any active work on its
   own.</P>
   
   <BLOCKQUOTE><PRE>$ rsh node16 ps -e f -o pid,ppid,pgrp,command --cols=100
  PID  PPID  PGRP COMMAND
  787     1   787 /usr/sge/bin/glinux/sge_commd
  789     1   789 /usr/sge/bin/glinux/sge_execd
 2999   789  2999  \_ sge_shepherd-11233 -bg
 5340   789  5340  \_ sge_shepherd-11262 -bg
 5359  5340  5359      \_ /bin/sh /var/spool/sge/node16/job_scripts/11262
 5360  5359  5359          \_ ./hello
 5350     1  5341 /home/reuti/pvm3/lib/LINUX/pvmd3 /tmp/11262.1.para16/hostfile
 5361  5350  5341  \_ /home/reuti/hello_other</PRE></BLOCKQUOTE>
   
   <P>On the slave node only the escaped <I>pvmd</I> and started
   hello_other can be found:&nbsp;</P>
   
   <BLOCKQUOTE><PRE>$ rsh node08 ps -e f -o pid,ppid,pgrp,command --cols=100
  PID  PPID  PGRP COMMAND
  787     1   787 /usr/sge/bin/glinux/sge_commd
  789     1   789 /usr/sge/bin/glinux/sge_execd
25950     1 25940 /home/reuti/pvm3/lib/LINUX/pvmd3 -s -d0x0 -nnode08 1 c0a89711:8110 4080 2 c0a
25951 25950 25940  \_ /home/reuti/hello_other</PRE></BLOCKQUOTE>
   
   <P>All files will go in this case to the default directory for
   PVM: /tmp.</P>
   
   <BLOCKQUOTE><PRE>$ rsh node16 ls -lh /tmp               
total 68K
drwxr-xr-x    2 reuti    users        4.0K Mar 25 23:21 11262.1.para16
drwx------    2 root     root          16K Apr 23  2004 lost+found
-rw-------    1 reuti    users          19 Mar 25 23:21 pvmd.502
-rw-------    1 reuti    users         127 Mar 25 23:21 pvml.502
srwxr-xr-x    1 reuti    users           0 Mar 25 23:21 pvmtmp005350.0
$ rsh node08 ls -lh /tmp               
total 64K
drwx------    2 root     root          16K Apr 23  2004 lost+found
-rw-------    1 reuti    users          19 Mar 25 23:21 pvmd.502
-rw-------    1 reuti    users         126 Mar 25 23:21 pvml.502
srwxr-xr-x    1 reuti    users           0 Mar 25 23:21 pvmtmp025940.0</PRE></BLOCKQUOTE>
   
   <P>In case of a proper shutdown of the job and the PVM, all the
   files but the pvml.502 (where the used number reflects your user
   ID) will be deleted again. Unless you enable the out-commented
   PVM_VMID in the start/stop script of this PE (and use it also in
   your job script by setting: "PVM_VMID=$JOB_ID; export PVM_VMID"),
   you are limited to one PVM per user per machine. A setting of
   PVM_VMID to the SGE supplied $JOB_ID will append this value to all
   PVM generated files, and makes them so unique for each job.</P>
   
   <P>In case of a job abort via qdel, the started daemons will
   proper shutdown, but the started programs hello_other may continue
   to work.</P></BLOCKQUOTE>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>Additions for Tight Integration</B></FONT></P>

<BLOCKQUOTE>To achieve a Tight Integration, some changes were
   necessary to the already supplied Loose Integration scripts:
   
   <UL>
      <LI>The first started <I>pvmd</I> (which is also responsible
      for starting the other <I>pvmds</I> on the slave nodes) has to
      be started by a local qrsh. This is done by a change to
      start_pvm.c to distinguish between the two startup modes. In
      case of a Tight Integration, it will now assemble a call to rsh
      instead of only exec'ing the <I>pvmd</I> directly.<BR>
      <BR>
      </LI>
      
      <LI>The rsh-wrapper from the $SGE_ROOT/mpi was taken as a
      starting point and modified, to direct the rsh call in
      <I>start_pvm</I> to a qrsh. In addition, the temporary
      directory for PVM has to be set, to be the SGE created
      temporary directory. This is done by prefixing the call to
      <I>pvmd</I> with "env PVM_TMP=\$TMPDIR". This way PVM_TMP will
      be set on the target node, and not on the source node.<BR>
      <BR>
      </LI>
      
      <LI>Any built-in rsh command in PVM will be replaced, by
      setting already in startpvm.sh "PVM_RSH=rsh; export PVM_RSH",
      which will honor the SGE rsh-wrapper this way.<BR>
      <BR>
      </LI>
      
      <LI>The first via qrsh started <I>pvmd</I> will now in turn
      call qrsh again to start the <I>pvmds</I> on the slave nodes.
      Because the default behavior is to fork into daemon land, the
      rsh-wrapper will append the option "-f" to the <I>pvmd</I>
      command, if it discovers that the rsh-call is not a local one.
      So the rsh-wrapper behaves in fact in two different ways.</LI>
   </UL>
</BLOCKQUOTE>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>Tight Integration</B></FONT></P>

<BLOCKQUOTE>Having already set up everything for the Loose
   Integration, you can switch to the Tight Integration by simply
   changing the definition of your PE and introducing -catch_rsh to
   the start procedure of the PE (and reversing the settings of
   control_slaves and job_is_first_task of TRUE and FALSE):
   
   <BLOCKQUOTE><PRE>$ qconf -sp pvm
pe_name           pvm
slots             46
user_lists        NONE
xuser_lists       NONE
start_proc_args   /usr/sge/pvm/startpvm.sh -catch_rsh $pe_hostfile $host /home/reuti/pvm3
stop_proc_args    /usr/sge/pvm/stoppvm.sh -catch_rsh $pe_hostfile $host
allocation_rule   1
control_slaves    TRUE
job_is_first_task FALSE
urgency_slots     min</PRE></BLOCKQUOTE>
   
   <P>The only thing left to do is now to set PVM_TMP like in
   tester_tight.sh in your job script, as in this location your
   executing program will look for information about the current
   PVM:</P>
   
   <BLOCKQUOTE>export PVM_TMP=$TMPDIR</BLOCKQUOTE>
   
   <P>After submitting the job in exact the same way as before (but
   this time taking the script tester_tight.sh in the qsub
   command):</P>
   
   <BLOCKQUOTE><PRE>$ qsub -pe pvm 2 tester_tight.sh</PRE></BLOCKQUOTE>
   
   <P>you should see a distribution to the head node of your job
   like:</P>
   
   <BLOCKQUOTE><PRE>$rsh node18 ps -e f -o pid,ppid,pgrp,command --cols=100
  PID  PPID  PGRP COMMAND
  788     1   788 /usr/sge/bin/glinux/sge_commd
  791     1   791 /usr/sge/bin/glinux/sge_execd
 7147   791  7147  \_ sge_shepherd-11238 -bg
 7183  7147  7183  |   \_ /bin/sh /var/spool/sge/node18/job_scripts/11238
 7184  7183  7183  |       \_ ./hello
 7161   791  7161  \_ sge_shepherd-11238 -bg
 7162  7161  7162      \_ /usr/sge/utilbin/glinux/rshd -l
 7164  7162  7164          \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sge/node18/active_jobs
 7166  7164  7166              \_ /home/reuti/pvm3/lib/LINUX/pvmd3 /tmp/11238.1.para18/hostfile
 7185  7166  7166                  \_ /home/reuti/hello_other
 7158     1  7148 /usr/sge/bin/glinux/qrsh -V -inherit node18 env PVM_TMP=$TMPDIR /home/reuti/pvm3/l
 7163  7158  7148  \_ /usr/sge/utilbin/glinux/rsh -p 46806 node18 exec '/usr/sge/utilbin/glinux/qrsh
 7165  7163  7148      \_ &#91;rsh &lt;defunct&gt;&#93;
 7173     1  7166 /usr/sge/bin/glinux/qrsh -V -inherit node20 env PVM_TMP=$TMPDIR $PVM_ROOT/lib/pvmd
 7181  7173  7166  \_ /usr/sge/utilbin/glinux/rsh -p 53007 node20 exec '/usr/sge/utilbin/glinux/qrsh
 7182  7181  7166      \_ &#91;rsh &lt;defunct&gt;&#93;</PRE></BLOCKQUOTE>
   
   <P>The important thing is, that the daemon and the started
   programs <I>hello</I> and <I>hello_other</I> are under full SGE
   control. Also the created files in the usual /tmp will instead be
   placed in the SGE created private directory for this job:</P>
   
   <BLOCKQUOTE><PRE>$ rsh node18 ls -h /tmp/11238.1.para18                  
hostfile
pid.1.node18
pvmd.502
pvml.502
pvmtmp007166.0
qrsh_client_cache
rsh</PRE></BLOCKQUOTE>
   
   <P>The same can be observed on the (in this case only one) slave
   nodes:</P>
   
   <BLOCKQUOTE><PRE>$ rsh node20 ps -e f -o pid,ppid,pgrp,command --cols=100
  PID  PPID  PGRP COMMAND
  787     1   787 /usr/sge/bin/glinux/sge_commd
  790     1   790 /usr/sge/bin/glinux/sge_execd
15397   790 15397  \_ sge_shepherd-11238 -bg
15398 15397 15398      \_ /usr/sge/utilbin/glinux/rshd -l
15399 15398 15399          \_ /usr/sge/utilbin/glinux/qrsh_starter /var/spool/sge/node20/active_jobs
15400 15399 15400              \_ /home/reuti/pvm3/lib/LINUX/pvmd3 -s -d0x0 -nnode20 1 c0a89713:814c
15401 15400 15400                  \_ /home/reuti/hello_other
$ rsh node20 ls -h /tmp/11238.1.para20
pid.1.node20
pvmd.502
pvml.502
pvmtmp015400.0</PRE></BLOCKQUOTE>
   
   <P>Regardlessly, whether the job will terminate by the intended
   program end or aborted with qdel: all processes and intermediate
   files will be cleanly removed. And we don't have to honor
   PVM_VMID, as all files are already separated into different
   temporary directories.</P></BLOCKQUOTE>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>Nodes with more than one network
interface</B></FONT></P>

<BLOCKQUOTE>If your cluster has two (or more) network interfaces in
   the nodes, you also have to set the start_proc_args to catch the
   call to hostname by -catch_hostname in case that you are not using
   the primary interface for PVM communication, as this will be used
   by the rsh_wrapper to distinguish between a local qrsh call and
   one to the slave nodes. In this hostname-wrapper, you must change
   the response from the original hostname call to give the name of
   the internal interface, which is also the one used in the SGE
   supplied $pe_hostfile. If you want to have the PVM communication
   on a complete other interface then SGE is aware of, you may have a
   look into $SGE_ROOT/mpi/startmpi.sh for mapping the $pe_hostfile
   supplied names also in $SGE_ROOT/pvm/startpvm.sh in a similar way
   (besides the necessary -catch_hostname).</BLOCKQUOTE>

<P>

<HR>

<BR>
<FONT SIZE="+1"><B>Restrictions and Future Work</B></FONT></P>

<BLOCKQUOTE>As SGE is granting the nodes/slots to your job, you
   shouldn't use any pvm_addhosts(), pvm_delhosts() or similar calls
   in your program, as this would violate the SEG scheduling of jobs
   to the cluster.
   
   <P>PVM uses internally a scheduler on it's own, to distribute the
   by pvm_spawn() initiated tasks to the nodes (unless you use the
   option to specify the target node directly in pvm_spawn() on your
   own &#91;according to the SGE granted nodes/slots of course&#93;).
   The built-in default is just a round robin scheme across all the
   granted nodes. So, setting a fixed allocation rule of 1 (or for
   dual CPU machines of 2) is the only safe setting, although you may
   spawn too many tasks to all the nodes, if your request to the SGE
   PE doesn't match the internal logic of your program.</P>
   
   <P>To allow a more flexible allocation of the nodes in SGE, a
   scheduler for PVM is in preparation, which will honor the SGE
   supplied allocation of slots per node. There is still only one
   daemon per node running, but the replacement scheduler will be
   responsible to get the correct distribution of the started
   processes in case of uneven granted slots on the nodes.</P></BLOCKQUOTE>

<P>

<HR>

<A NAME="References and Documents"></A><FONT SIZE="+1"><B>References
and Documents</B></FONT></P>

<BLOCKQUOTE><B>SGE-PVM Integration</B>
   
   <P><A NAME="[1] Archive sge-pvm-integration"></A>&#91;1&#93;
   Archive with all the scripts used in this Howto: <A HREF="pvm60.tgz">pvm60.tgz</A>
   &#91;for older installations using SGE 5.3: <A HREF="pvm53.tgz">pvm53.tgz</A>&#93;.<BR>
   <A NAME="[2] Archive pvm_hello"></A>&#91;2&#93; Archive with a
   small PVM program from the PVM User's Guide: <A HREF="pvm_hello.tgz">pvm_hello.tgz</A>.<BR>
   </P>
   
   <P><B>PVM</B></P>
   
   <P>The latest version of PVM and build instructions can be
   downloaded from (<A HREF="http://www.netlib.org/pvm3/">http://www.netlib.org/pvm3</A>).<BR>
   </P>
   
   <P><B>PVM documentation in general and tutorials</B></P>
   
   <P>For some general introduction to PVM and PVM-Programming, you
   can study the following documents:</P>
   
   <UL>
      <LI><A HREF="http://www.netlib.org/pvm3/book/pvm-book.ps">http://www.netlib.org/pvm3/book/pvm-book.ps</A></LI>
      
      <LI><A HREF="http://www.netlib.org/pvm3/refcard.ps">http://www.netlib.org/pvm3/refcard.ps</A></LI>
   </UL>
</BLOCKQUOTE>
</BODY>
</HTML>
